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Accidents and personal injuries in Colombia 


The Accidents in Colombia project is an initiative to improve safety in the country. We will study 
the behavior of accidents and how they affect different ages and genders. In addition, we are 
going to use machine learning algorithms to identify patterns and trends in accidents, as well as 
to analyze the type of weapons involved. This information will help us create effective programs 
and policies to reduce accidents and improve safety across the country. This initiative is an 
excellent opportunity to contribute to Colombia's security and improve the lives of its citizens. 


This project consist on a analysis of data of accidents in Colombia. The goal is to find patterns 
and factors in the incidence of accidents in country. The analysis is done with Policia Nacional 
compiled data, which include districts and cities as well as number of accidents, date, gender 

and behavior. 


We will be use various types of statistical methods to iondentufy patterns and relations which 
are useful for the goberment and the public. This patterns and relaionsto be will use to make 
recomendation to improve the public policies to try to reduce the number of accidents in the 
country. 


About dataset 


In this dataset we have 1 million accidents from January 2010 to August 2022. their causes, 
weapon or means by which the event occurred. These data are from Policia Nacional and 
extracted by datos abiertos Colombia 


What we want to figure out with this analysis? 


e How many people per year, month and day have an accidents and personal injuries? 
* Which departments and boroughs with the most accidents and personal injuries? 

* Which weapons are the most used in personal injuries by gender and department? 
* When ocurrs this accicents by month day and week? 

* What gender is the most affected by the accidents? 


import libraries 


import pandas as pd 

import numpy as np 

import matplotlib.pyplot as plt 
import seaborn as sns 

import statsmodels.api as sm 


We loaded the dataset Personal Injuries and Traffic 
Accidents from the Policia Nacional 
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df = pd.read csv("C:/Users/Jorge/Downloads/Reporte Lesiones Personales y en Accidente. 


df 
DEPARTAMENTO MUNICIPIO gerens ARMAS MEDIOS Masi GENERO 
0 ANTIOQUIA GIRARDOTA 5308000 E UE 1/01/2010 FEMENINO 
1 ANTIOQUIA GIRARDOTA 5308000 Ca Mab aes 1/01/2010 MASCULINO 
2 ANTIOQUIA MUTATÁ 5480000 ee 1/01/2010 MASCULINO 
3 ANTIOQUIA NECOCLÍ 5490000 oe 1/01/2010 FEMENINO 
4 ATLANTICO ae 8001000 Peredo tan 1/01/2010 FEMENINO 
1047244 CESAR ds 20001000 VENENO 3/05/2022 MASCULINO 
1047245 HUILA OPORAPA 41503000 VENENO 16/06/2022 FEMENINO ADC 
1047246 TOLIMA IBAGUÉ (CT) 73001000 VENENO 17/04/2022 MASCULINO 
1047247 CUNDINAMARCA COTA 25214000 ali ice 30/03/2022 MASCULINO 
1047248 CUNDINAMARCA GUADUAS 25320000 AN OF 10/06/2022 MASCULINO 


ARMAS 


1047249 rows x 9 columns 


We start to understand the dataset 


e Check the date 


e Check the shape of the data 


* Review the quality of the data, verify if are null values 


e Review the format of the columns 


#revisar desde que fecha empieza y termina 
print(df['FECHA HECHO' ].min()) 
print(df['FECHA HECHO' ].max()) 
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1/01/2010 
9/12/2021 
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df .describe(include='object' ) 


DEPARTAMENTO MU 


count 1047249 


unique 32 


top CUNDINAMARCA 


freq 134439 


round(df .describe()) 


CANTIDAD 
count 1047249.0 
mean 2.0 
std 2.0 
min 1.0 
25% 1.0 
50% 1.0 
75% 1.0 
max 114.0 
df . shape 


(1047249, 9) 


df.info() 


CODIGO 
DANE 


NICIPIO 
1047249 1047249 
1023 1250 


BOGOTÁ 


ARMAS 
MEDIOS 


1047249 


45 


11001000 CONTUNDENTES 


D.C. (CT) 


61226 61226 


«class 'pandas.core.frame.DataFrame'> 
RangeIndex: 1047249 entries, © to 1047248 
Data columns (total 9 columns): 


# 


NOuUuUBPWNF OO! 


8 


Column 

DEPARTAMENTO 
MUNICIPIO 

CODIGO DANE 

ARMAS MEDIOS 

FECHA HECHO 

GENERO 

GRUPO ETARIO 
DESCRIPCION CONDUCTA 
CANTIDAD 


Non-Null Count 

1047249 non-null 
1047249 non-null 
1047249 non-null 
1047249 non-null 
1047249 non-null 
1047249 non-null 
1046285 non-null 
1047249 non-null 
1047249 non-null 


dtypes: int64(1), object(8) 
memory usage: 71.9+ MB 
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368472 


FECHA 
HECHO 


1047249 


4626 


1/01/2020 


1346 


GENERO 


1047249 


5 


MASCULINO 


592363 


GRUPO 
ETARIO 


1046285 


5 


ADULTOS 


853564 
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df.isnull().sum() 


DEPARTAMENTO 
MUNICIPIO 

CODIGO DANE 

ARMAS MEDIOS 

FECHA HECHO 

GENERO 

GRUPO ETARIO 
DESCRIPCIÓN CONDUCTA 
CANTIDAD 

dtype: int64 


In this data set, there are some null values, (not as many as | expected) but the column is a 
categorical column, so we have to fill these values with some value that we can work with. 


Data Cleaning 


df ['GENERO' ].drop duplicates() 


0 FEMENINO 
1 MASCULINO 
109 NO REPORTA 


785327 NO REPORTADO 
863052 x 


Name: GENERO, dtype: object 


dict = ('FEMENINO': 'femenino', 
'MASCULINO':'masculino', 
'NO REPORTA':'no reporta', 
"NO REPORTADO':'no reporta', 


-'i'no reporta') 
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df['GENERO'] = df['GENERO' ].replace(dict) 


df ['GENERO' ].drop duplicates() 


0 femenino 
1 masculino 
109 no reporta 


Name: GENERO, dtype: object 


df['GRUPO ETARIO'].drop duplicates() 


0 ADULTOS 
12 ADOLESCENTES 
107 MENORES 
132858 NO REPORTA 
785327 NO REPORTADO 
863052 NaN 


Name: GRUPO ETARIO, dtype: object 


df['GRUPO ETARIO'] = df['GRUPO ETARIO'].fillna('NO REPORTADO' ) 


darte ADV NOS a EOS 
"ADOLESCENTES': ' adolescentes', 
"MENORES': 'menores', 
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In [ ]: 


In [ ]: 


In [ ]: 


In [ ]: 


"NO REPORTA': 'no reporta', 
"NO REPORTADO ':'no reporta' } 
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rename dict = ('DEPARTAMENTO' :' departamento', 


'MUNICIPIO': 'municipio', 


"ARMAS MEDIOS' 


"FECHA HECHO':'fecha hecho', 'DESCRIPCIÓN CONDUCTA':'descripción conducta', 
'CANTIDAD':'cantidad', 'GENERO':'genero', GRUPO ETARIO':'grupo etario') 


"ARMA DE FUEGO 


' CONTUNDENTES ' : 


'MOTO' 

'NO REPORTA 

'POLVORA(FUEGOS PIROTECNICOS) 
' PUNZANTES 


Welle 


"COMBUSTIBLE 
' JERINGA 
' PERRO 


IBICTELENAS: 


"ARTEFACTO EXPLOSIVO/CARGA DINAMITA 


"MINA ANTIPERSONA' : 
"SUSTANCIAS TOXICAS': 


"SIN EMPLEO DE ARMAS' 


"AGUA CALIENTE! : 
"ESCOPOLAMINA' : 


"OLLA BOMBA" 


"GRANADA DE MANO" : 
"PAQUETE BOMBA": 


' MEDICAMENTOS ' 

"VENENO" 

' QUIMICOS ' 

' CARRO BOMBA' 

GASES: 

' CINTAS/CINTURON ' : 
"ARTEFACTO INCENDIARIO' 
"PAPA EXPLOSIVA' 
"ALIMENTOS VENCIDOS': 
"LICOR ADULTERADO' 
"ACIDO' 

"ALUCINOGENOS ' 
"ALMOHADA" 

'BOLSA PLASTICA' 
"CORTANTES" 
"CUCHILLA" 

"DIRECTA" 

"ARMAS BLANCAS ' 
"PRENDAS DE VESTIR' 
'CILINDRO BOMBA' 

"NO REPORTADO' 
'CINTURON BOMBA' 
"ARMA TRAUMATICA' 


df['GRUPO ETARIO'].drop_duplicates() 
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armas dict = ('ARMA BLANCA / CORTOPUNZANTE' : 'cortopunzante', 


:'arma de fuego', 
'contundentes', 
:'vehiculo', 

'i'no reporta', 
:'explosivos', 
:'cortopunzante', 
'vehiculo', 
:'combustible', 
:'material medico', 
:'animales', 
'vehiculo', 
:'explosivos', 
'explosivos', 
'sustacias tóxicas', 
:'sin armas', 
'casero', 

'sustancias tóxicas', 
:'explosivos', 
'explosivos', 
'explosivos', 
:'material medico', 
:'sustancias tóxicas', 
:'sustancias tóxicas', 


:'explosivos', 


'sustancias tóxicas', 
'materiales', 


:'explosivos', 
:'explosivos', 


'sustancias tóxicas', 


:'sustancias tóxicas', 
cad Gd Oleg, 

:'sustancias tóxicas', 
:'materiales', 
:'materiales', 
:'cortopunzante', 
:'cortopunzante', 
:'materiales', 
:'cortopunzante', 
:'materiales', 
:'explosivos', 

-'i'no reporta', 

:'no reporta', 
:'explosivos', 
:'contundentes') 


df['GRUPO ETARIO'] = df['GRUPO ETARIO'].replace(dict 1) 


5/50 


12/4/23, 21:26 colombian_acc 


0 adultos 
12 adolescentes 
107 menores 
132858 no reporta 


Name: GRUPO ETARIO, dtype: object 
df['FECHA HECHO'] = pd.to_datetime(df['FECHA HECHO'], format="%d/%m/%Y' ) 


df.columns 


Index(['DEPARTAMENTO', 'MUNICIPIO', 'CODIGO DANE', 'ARMAS MEDIOS', 
'FECHA HECHO', 'GENERO', 'GRUPO ETARIO', 'DESCRIPCIÓN CONDUCTA', 
' CANTIDAD ' ], 
dtype='object') 


df = df.rename(columns-(rename dict)) 


df['armas medios'].drop duplicates() 
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832 
1030 
1116 
1331 
2034 
2249 
2288 
2647 
2648 
2782 
2784 
2994 
3726 
4188 
14611 
26425 
31949 
35676 
37359 
45171 
68877 
74904 
76346 
187488 
201235 
201309 
208798 
449553 
454954 
601433 
706864 
862859 
864668 
942155 
960009 
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ARMA BLANCA / CORTOPUNZANTE 
ARMA DE FUEGO 
CONTUNDENTES 

MOTO 

NO REPORTA 
POLVORA(FUEGOS PIROTECNICOS) 
PUNZANTES 

VEHICULO 
COMBUSTIBLE 

JERINGA 

PERRO 

BICICLETA 

ARTEFACTO EXPLOSIVO/CARGA DINAMITA 
MINA ANTIPERSONA 
SUSTANCIAS TOXICAS 
SIN EMPLEO DE ARMAS 
AGUA CALIENTE 
ESCOPOLAMINA 

OLLA BOMBA 

GRANADA DE MANO 
PAQUETE BOMBA 
MEDICAMENTOS 
VENENO 

QUIMICOS 

CARRO BOMBA 

GASES 
CINTAS/CINTURON 
ARTEFACTO INCENDIARIO 
PAPA EXPLOSIVA 
ALIMENTOS VENCIDOS 
LICOR ADULTERADO 
ACIDO 

ALUCINOGENOS 
ALMOHADA 

BOLSA PLASTICA 
CORTANTES 

CUCHILLA 

DIRECTA 

ARMAS BLANCAS 
PRENDAS DE VESTIR 
CILINDRO BOMBA 

NO REPORTADO 
CINTURON BOMBA 

ARMA TRAUMATICA 


Name: armas medios, dtype: object 


df['armas medios'] 


df['departamento'] 


df['municipio'] = df['municipio'].str.lower() 


df['descripción conducta'].drop duplicates() 


0 
384 


LESIONES PERSONALES 
LESIONES CULPOSAS ( EN ACCIDENTE DE TRANSITO ) 


Name: descripción conducta, dtype: object 
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df['armas medios'].replace(armas dict) 


df['departamento'].str.lower() 


7/50 


12/4/23, 21:26 


colombian_acc 


desc_dict = 


('LESIONES PERSONALES':'lesiones personales’, 


df['descripción conducta'] = df['descripción conducta'].replace(desc dict) 


df['descripción conducta'].drop duplicates() 


2) lesiones personales 
384 lesiones culposas 
Name: descripción conducta, dtype: object 


df = df.drop(columns=[ 'CODIGO DANE']) 
df 


departamento municipio armas medios 
0 antioquia girardota  cortopunzante 
1 antioquia girardota  cortopunzante 
2 antioquia mutatá ` cortopunzante 
3 antioquia necoclí cortopunzante 
NE barranquilla 
4 atlántico cortopunzante 
(ct) 
valledupar sustancias 
1047244 cesar Sé 
(ct) toxicas 
e sustancias 
1047245 huila oporapa Se 
toxicas 
, e sustancias 
1047246 tolima  ¡bagué (ct) jos 
tóxicas 
1047247  cundinamarca cota sin armas 
1047248 ` cundinamarca guaduas sin armas 


1047249 rows x 8 columns 


df.info() 


«class 'pandas.core.frame.DataFrame'> 
RangeIndex: 1047249 entries, @ to 1047248 
Data columns (total 8 columns): 


+ — Column Non-Null Count 

© departamento 1047249 non-null 
1 municipio 1047249 non-null 
2 armas medios 1047249 non-null 
3 fecha hecho 1047249 non-null 
4 genero 1047249 non-null 
5 grupo etario 1047249 non-null 
6 descripción conducta 1047249 non-null 
7 cantidad 1047249 non-null 


fecha hecho genero 
2010-01-01 femenino 
2010-01-01 masculino 
2010-01-01 masculino 
2010-01-01 femenino 
2010-01-01 femenino 
2022-05-03 masculino 
2022-06-16 femenino 
2022-04-17 masculino 
2022-03-30 masculino 
2022-06-10 masculino 
Dtype 

object 

object 

object 
datetime64[ns] 
object 

object 

object 

int64 


dtypes: datetime64[ns](1), int64(1), object(6) 


memory usage: 63.9+ MB 
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grupo etario 
adultos 
adultos 
adultos 


adultos 


adultos 


adultos 


adolescentes 


adultos 


adultos 


adultos 


"LESIONES CULPOSAS ( EN ACC] 


descripciór 
lesiones 
lesiones 
lesiones 


lesiones 


lesiones 


lesiones 


lesiones 


lesiones 


lesiones 


lesiones 
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df.isnull().sum() 


departamento 
municipio 

armas medios 

fecha hecho 

genero 

grupo etario 
descripción conducta 
cantidad 

dtype: int64 


Oo o Go o o GO o 
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df = df[['fecha hecho','departamento', municipio','armas medios','genero','grupo etari 


df 
fecha hecho 
O 2010-01-01 
1 2010-01-01 
2 2010-01-01 
3 2010-01-01 
4 2010-01-01 
1047244 2022-05-03 
1047245 2022-06-16 
1047246 — 2022-04-17 
1047247 2022-03-30 
1047248 2022-06-10 


departamento 
antioquia 
antioquia 
antioquia 


antioquia 


atlántico 


cesar 


huila 


tolima 


cundinamarca 


cundinamarca 


1047249 rows x 8 columns 


municipio 
girardota 
girardota 
mutatá 
necocli 


barranquilla 
(ct) 


valledupar 


(ct) 


oporapa 


ibagué (ct) 


cota 


guaduas 


armas medios 
cortopunzante 
cortopunzante 
cortopunzante 


cortopunzante 


cortopunzante 


sustancias 
tóxicas 


sustancias 
tóxicas 


sustancias 
tóxicas 


sin armas 


sin armas 


At this point, the dataset is more organized and standarized. 


EDA 


In this project, we are performing exploratory data analysis (EDA) on a dataset in order to 


genero 


femenino 


masculino 


masculino 


femenino 


femenino 


masculino 


femenino 


masculino 


masculino 


masculino 


grupo etario 
adultos 
adultos 
adultos 


adultos 


adultos 


adultos 


adolescentes 


adultos 


adultos 


adultos 


descripciór 
lesiones 
lesiones 
lesiones 


lesiones 


lesiones 


lesiones 


lesiones 


lesiones 


lesiones 


lesiones 


extract useful information and insights. To do this, we have defined several functions that allow 


us to create new columns based on existing data in the dataset. 


For example, we may have defined a function to extract the year from a date column, or a 


function to extract a particular substring from a text column. By creating these new columns, we 
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can gain new insights into the data and answer questions that were previously difficult or 
impossible to answer. 


In addition to defining these functions, we are also using various data visualization techniques 
to explore the data and identify patterns or trends. We may be creating histograms, scatterplots, 
or other types of plots to help us better understand the relationships between different 
variables in the dataset. 


Overall, the goal of this EDA project is to gain a deeper understanding of the dataset and the 
underlying processes that generated it. By doing so, we can make more informed decisions and 
identify opportunities for improvement or optimization. 


Anwering the question of accidents by month | define the function 'MES' to obtain the result, 
subsequently | will plot the result and we can see how is the behavior of the accidents with a 
line chart. 


# GHistogram 

sns.histplot(df, x='departamento', bins-20, kde=True) 
plt.title( ‘Distribution of Departament ') 
plt.xticks(rotation-90) 

plt.show() 
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Distribution of Departament 
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The histogram reveals an interesting distribution of the data, showing a multimodal pattern in 
the department variable, indicating the presence of several modes in the distribution. The graph 
also suggests that there are certain departments that occur more frequently than others, 
leading to the multiple peaks in the histogram. This suggests that there may be underlying 
factors that contribute to the frequency of accidents in specific departments, which could be 
further explored in the analysis. Overall, the histogram provides valuable insights into the 
distribution of accidents across departments and highlights areas that require further 
investigation. 


In [ ]: fig, (axi, ax2) = plt.subplots(1, 2, figsize=(12, 5)) 
sns.histplot(data-df, x="cantidad", ax=ax1, kde-True) 
ax1.set_xlim(@, 15) 


avl. set title("Distribution of quantity of accidents and personal injuries") 


sns.histplot(data-df, x-"genero", ax-ax2, kde-True) 
ax2.set title("Gender distribution of victims") 


plt.tight layout() 
plt.show() 
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4e7 Distribution of quantity of accidents and personal injuries 1e6 Gender distribution of victims 
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These histograms reveal the distribution of the "cantidad" and "genero" variables. We can 
observe a right-skewed distribution in the "cantidad" histogram, indicating that the most 
common value is 1, followed by 2. In the "genero" histogram, we can see several peaks that 
make the graph multimodal. This suggests that there are multiple modes in the gender 
distribution, which could indicate some underlying patterns or factors affecting the distribution. 
Further analysis is necessary to fully understand these patterns and their implications. 


In [ ]: ax = sns.catplot(data-df, x='departamento', y='cantidad', kindz'bar', aspect-2.5) 
ax.set(title-'Cantidad de accidentes por departamento', xlabel-'Departamento', ylabel: 
ax.set xticklabels(rotation-90) 
plt.show() 
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This graph shows the number of accidents by department, where we can see that Cundinamarca 
is the department with the highest number of accidents, well above the other departments, 
which are quite similar to each other, and there is not much difference between them. It is 
interesting to note that this information could be useful for authorities and policymakers to 
allocate resources and take measures to reduce accidents, especially in Cundinamarca. Further 
analysis could also be done to explore the possible reasons why Cundinamarca has a higher 
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number of accidents compared to other departments, such as the road infrastructure, 
population density, and economic activities. Overall, this graph provides valuable insights into 
the distribution of accidents by department, which could help to improve safety in Colombia. 


def MES(df): 


Group accidents by month 


Arguments: 
“df”: A pandas DataFrame 


Outputs: 
^monthly accidents': The grouped Series 


# YOUR CODE HERE 

df['fecha hecho'] = pd.to datetime(df['fecha hecho']) 
df['mes'] = df["fecha hecho"].dt.to period('M') 
monthly accidents - df.groupby("mes").size() 

return monthly accidents 


MES(df).plot.line() 
plt.title('Accidentes por mes a lo largo de los años') 
plt.xlabel('Años') 


Text(0.5, 0, 'Afios') 


Accidentes por mes a lo largo de los anos 
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In recent years, accidents have increased significantly, not only in traffic but also in other areas 
such as the use of bladed weapons or firearms, contact with corrosive acids, among others. This 
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means that more and more people are suffering the tragic effects of accidents, ranging from 
serious injuries, permanent disability to death. The year 2020 was an exception to this trend, 
due to the Covid-19 pandemic, which caused a decrease in the number of accidents worldwide. 
However, early 2021 and 2022 have seen a significant increase in the number of accidents, 
although they still remain below the numbers of the years prior (to the pandemic). This shows 
us that there is still a significant risk of suffering an accident, either from traffic, the use of 
weapons or contact with corrosive acids, so it is important that we all take the necessary 
measures to prevent these accidents, such as being more aware when driving, comply with 
speed limits, do not drive under the influence of alcohol or drugs, among other measures. This 


will help to reduce the number of victims and prevent unnecessary accidents. 


Now let’s move on to the behavior for the days of the week. 


def DIA(df): 


Group accidents by day of the week 


Arguments: 
“df” : A pandas DataFrame 


Outputs: 
^weekday accidents': The grouped Series 


# YOUR CODE HERE 

df['dia semana'] = pd.to datetime(df['fecha hecho']).dt.weekday 
weekday accidents - df.groupby(["dia semana"]).size() 

return weekday accidents 


DIA(df).plot.bar() 
plt.title('Accidents per day of the week') 


Text(0.5, 1.0, 'Accidents per weekday') 
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Accidents per weekday 
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Looking at the bar plot, we can notice a clear pattern: on Mondays, accidents are slightly higher 
than the rest of the days of the week, while weekends (from Saturday onwards) start to show a 
significant increase in the number of accidents and personal injuries, with Sunday being the day 
with the highest accident rate. This leads us to conclude that weekends represent a higher risk 
of suffering an accident. This could be explained by the excesses to which some people subject 
themselves during the weekend, such as substance abuse, fights, drunk driving, among others. 
This is a situation that should be taken into account to prevent these accidents and minimize 
their impact. Measures should be taken such as the implementation of awareness campaigns on 
the risks of drunk driving, substance abuse and the use of weapons. Stricter controls should also 
be implemented to prevent excessive use of substances, use of weapons, and speeding. These 
measures will help prevent these accidents and save lives. 


In [ ]: def DEPARTAMENTO(df): 


Group accidents by borough 


Arguments: 
"df" : A pandas DataFrame 


Outputs: 
“boroughs' : The grouped Series 


# YOUR CODE HERE 

df['departamento'].drop duplicates() 

boroughs - df.groupby(['departamento']).size() 
return boroughs 
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In [ ]: DEPARTAMENTO(df).plot.bar(color='purple' ) 


out! J: <AxesSubplot: xlabel='departamento' > 
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Looking at the graph above, we can see that the Colombian departments with the highest 
accident rate are Cundinamarca, Valle, Antioquia, Santander and Tolima. This leads us to 
conclude that these five departments, despite representing only 2096 of the Colombian 
population, account for almost 4096 of the accidents, which shows that the risk of suffering an 
accident is higher in these departments. This can be explained by the lack of adequate 
infrastructure for transportation, lack of awareness of drivers, speeding, substance abuse, 
handling and carrying weapons, lack of citizen awareness, among other factors. These are risks 
that must be taken into account to prevent these accidents and save lives. In addition, it is 
necessary for local governments to implement awareness campaigns on the risks of drunk 
driving, substance abuse, and the carrying of firearms and weapons, to promote a culture of 
tolerance, as well as stricter controls to minimize the number of accidents. 


def MUNICIPIO(df): 


Group accidents by borough 
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Arguments: 
“df: A pandas DataFrame 


Outputs: 
“boroughs : The grouped Series 


# YOUR CODE HERE 

df['municipio'].drop duplicates() 

boroughs - df.groupby(['municipio']).size() 

boroughs - boroughs.sort values(ascending-False).head(15) 
return boroughs 


in | |: MUNICIPIO(df).plot.bar(color-'orange') 


out[ ]: <AxesSubplot: xlabel='municipio' > 
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The bar chart above shows the municipalities in Colombia with the highest number of accidents 
and personal injuries. The chart indicates that the top five municipalities with the highest 
number of accidents and personal injuries are Bogota D.C., Cali, Medellin, Bucaramanga, and 
Barranquilla. 
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the bar chart provides a high-level overview of the municipalities with the highest number of 
accidents and personal injuries in Colombia. This information can be used to develop targeted 


interventions to reduce the number of accidents and injuries in these areas, ultimately leading 
to a safer and more secure society for all. 


def MES DEPAR(df): 


Calculate accidents per hour for each borough 


Arguments: 
“df” : A pandas DataFrame 


Outputs: 


"bor hour': A Series. This should be the result of doing groupby by borough 
and hour. 


# YOUR CODE HERE 
df['mes'] = pd.to datetime(df['fecha hecho']).dt.to period('M') 
bor hour- df.groupby(["departamento","mes"]).size() 


return bor hour 


fig, ax = plt.subplots(figsize-(15,5)) 
ax - MES DEPAR(df).plot() 
plt.xticks(rotation-90) 

plt.show() 
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The line chart above shows the trends of accidents and personal injuries over time, specifically 
focusing on the year with the highest number of cases in the Department of Córdoba. The chart 
indicates that the year with the highest number of accidents and personal injuries in the 
Department of Córdoba was 2012. 


def FACTORES(df) : 


Finds which 6 factors cause the most accidents, without 
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In [ ]: 


Out[ ]: 


colombian acc 


double counting the contributing factors of a single accident. 


Arguments: 
^contrib dr A pandas DataFrame. 


Outputs: 

"factors most acc': A pandas DataFrame. It has only 10 elements, which are, 
sorted in descending order, the contributing factors with the most accidents. 
The column with the actual numbers is named “index. 


# YOUR CODE HERE 


contrib df - pd.melt(df.reset index(),id vars -"index", value vars- 'armas medios' 
contrib df- contrib df.drop(columns-['variable']) 

contrib df = contrib df.drop duplicates(keep-'first') 

factors most acc = contrib df.groupby('value').count().sort values(by-'index', asc 
factors most acc- factors most acc.head(10) 


return factors most acc 


FACTORES (df).plot.bar(color-'purple') 


«AxesSubplot:xlabel-'value'» 
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The bar chart above displays the 10 most common factors associated with accidents in 
Colombia. According to the chart, the most frequent factor is blunt objects, which suggests that 
most frequent personal injurie in Colombia result from one person striking another with a non- 
cutting weapon. The second most common factor is vehicular accidents, which is not surprising 
given the high number of cars and motorcycles in the country. The third most frequent factor is 
sharp-edged weapons such as knives and machetes. 


def CONDUCTA(df): 


Finds which 6 factors cause the most accidents, without 
double counting the contributing factors of a single accident. 


Arguments: 
^contrib df': A pandas DataFrame. 


Outputs: 

^contrib df: A pandas DataFrame. It has only 6 elements, which are, 

sorted in descending order, the contributing factors with the most accidents. 
The column with the actual numbers is named “index”. 


# YOUR CODE HERE 
contrib df = pd.melt(df.reset index(),id vars ="index", value vars- 'descripción « 
contrib df- contrib df.drop(columns-['variable']) 
contrib df = contrib df.drop duplicates(keep-'first') 
conduct acc = contrib df.groupby('value').count().sort values(by-'index', ascendir 
return conduct acc 

CONDUCTA(df) .plot(kind='bar' ) 


<AxesSubplot: xlabel='value'> 


file:///C:/Users/Jorge/Downloads/Projects/Colombia_acc/colombian_acc.html 20/50 


12/4/23, 21:26 colombian_acc 


700000 


600000 


500000 


400000 


300000 


200000 


100000 


lesiones culposas 


Vi 
E 
m 
c 
o 
E 
E" 
a 
Vi 
a 
E 
2 
Vi 
A 


value 


Based on the information provided, it appears that the majority of the cases are accidents 
caused by either the injured person's carelessness or unintentional harm caused by another 
person. However, there is a significant percentage of cases that are caused intentionally by 
other individuals with the intention to harm or even take the victim's life. Additionally, some 
cases may reflect attempted suicides, although it is unclear from the given information. 


It is important to note that speculation without sufficient evidence can be misleading and 
potentially harmful. Therefore, it is necessary to further investigate and analyze the data to 
accurately determine the causes of these accidents and take appropriate measures to prevent 
them. 


in [ ]: contingency = pd.crosstab(columns=df['armas_medios'],index=df['departamento']) 
contingency = contingency 
contingency 
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armas_medios 
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The previous heat map shows that regardless of the department in Colombia, the main causes 
of accidents and personal injuries are blunt and sharp-edged weapons, as well as vehicular 
accidents. This information can be used to develop policies and measures to prevent such 
accidents. However, a more in-depth study is required to determine whether the causes are 
intentional, such as crimes or attempted homicides, or unintentional accidents without any 
intent to harm. 
Although the data does not reveal the cause of the accidents, we can infer that street fights and 
robbery attempts are the most common reasons for such incidents. It is important to note that 
speculations without sufficient evidence can be misleading and potentially harmful. Therefore, 
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further investigation and analysis of the data are necessary to determine the root cause of these 
accidents accurately. 


Based on the heat map, it is evident that there is a need for more focused efforts to prevent the 
use of blunt and sharp-edged weapons in crimes and reduce the number of vehicular accidents. 
This information can be used to develop targeted policies to improve public safety and reduce 
the number of accidents and injuries in the country. 


con df = pd.melt(contingency.reset_index(), id vars -['departamento'], value vars zl: 


'arma de fuego' ,'casero', 'combustible', 

'contundentes', 'cortopunzante', 'explosivos', 'material medi 
'sin armas', 'sustacias tóxicas', 'sustancias tóxicas', 

'vehiculo', 'ácido'], 


var name ='medios', value name ='values') 
con df 


departamento medios values 


0 amazonas animales 18 
1 antioquia animales 154 
2 arauca animales d 
3 atlantico animales 30 
4 bolivar animales 2 
475 sucre acido 515 
476 tolima acido 73 
477 valle acido 64 
478 vaupés ácido 1 
479 vichada ácido 7 


480 rows x 3 columns 


sns.boxplot(contingency) 
plt.xticks(rotation-90) 
plt.show() 
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in [ ]: sns.barplot(data=contingency) 
plt.xticks(rotation=90) 
plt.show() 
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The previous charts confirm what the heatmap revealed earlier. Here we can observe the 
distribution of the most frequent causes of accidents and personal injuries throughout the 
entire country. As previously mentioned, blunt and sharp-edged weapons, as well as vehicular 
accidents, are the most common causes in most departments. 


In [ ]: def armas gen(df): 


df: pandas dataframe 


arguments: 
output: pandas dataframe, it has only 3 columns 
gender, means to make accident and quantity 


contingency 2 = pd.crosstab(index-df['genero'],columns-df['armas medios']) 
d f = pd.melt(contingency 2.reset index(), id vars -['genero'], value vars =['anin 


'arma de fuego' ,'casero', 'combustible', 
'contundentes', 'cortopunzante', 'explosivos', 'material medi 
'sin armas', 'sustacias tóxicas', 'sustancias tóxicas', 
'vehiculo', MeKealeloy™ lz 
var name -'medios', value name ='cantidades' ) 


return d f 


file:///C:/Users/Jorge/Downloads/Projects/Colombia acc/colombian acc.html 26/50 


12/4/23, 21:26 colombian_acc 


in | ]: armas gen(df).head() 


Out[ ]: genero medios cantidades 
0 femenino animales 904 
1 masculino animales 908 
2 noreporta animales 216 
3 femenino arma de fuego 6927 
4 masculino arma de fuego 34929 


in [ ]: sns.barplot(data-armas gen(df), x='medios',y='cantidades' ,hue='genero' ) 
plt.xticks(rotation-90) 
plt.title('Cantidades en armas usadas y genero') 
plt.xlabel( 'Medios') 
plt.ylabel('Cantidades') 


plt.show() 
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The previous graph shows the relationship between the gender of the victim and the type of 
weapon used in causing the injury. It is evident that adult males are the most affected by both 
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vehicular accidents and personal injuries caused by sharp-edged and blunt objects. However, it 
is interesting to note that blunt objects affect women to a much greater extent than sharp- 
edged weapons, suggesting a high level of female involvement in violence. This could be 
attributed to domestic violence, which is a common cause of injury among women. Further 
analysis is needed to determine the root causes of violence against women in Colombia and to 
develop effective policies to prevent it. 


In | ]: sns.barplot(data=df,x='grupo_etario',y='cantidad',hue='genero') 
plt.title('Accidentes por genero y grupo etario') 
plt.xlabel('Edades') 
plt.ylabel('Cantidad (miles)') 


Out[ ]: Text(0, 0.5, 'Cantidad (miles)') 
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Looking at the bar graph, it can be seen that accidents affect men and women in a very similar 
way, with adults being the most affected group. Although data on the gender and age of those 
affected are not reported, we know that these accidents also affect children and adolescents, 
which represents a large number of people who have suffered some kind of accident. This leads 
us to conclude that the factors contributing to the number of accidents are not limited to a 
single gender or age, but are many and varied. Among them, we can highlight inappropriate 
driving behavior, such as driving under the influence of alcohol, speeding, substance abuse, use 
of weapons, among others. These factors, together with the lack of awareness and respect for 
traffic rules, contribute to the increase in accidents shown in the graph. Therefore, it is necessary 
that we all become aware of the dangers and risks involved in not respecting the rules, in order 
to prevent these accidents and save lives. 
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In [ ]: sns.set style("whitegrid") 
plt.figure(figsize-(15, 8)) 
ax = sns.barplot(data=df, x="departamento", y='cantidad',hue="genero", palette="muted' 
ax.set title('Cantidad de accidentes por departamento y género') 
ax.set xlabel('Departamento') 
ax.set ylabel('Cantidad') 
ax.set xticklabels(ax.get xticklabels(), rotation-90) 
plt.legend(loc-'upper right') 
plt.show() 
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Analyzing the bar chart, we can clearly observe that the department of Cundinamarca has the 
highest number of accidents and personal injuries in Colombia. Although women are more 
affected, the difference between genders is not statistically significant. However, it is concerning 
that there are a significant number of accidents where the gender of the victim was not 
reported. This makes it difficult to determine the true gender distribution of the victims, which is 


crucial information for designing effective interventions. 


We can also see that the number of accidents and personal injuries is quite high, which is 
alarming. This could be attributed to various factors such as the high rate of violence in Bogota 
city or poor road safety measures. It is important to consider the underlying causes of these 
accidents to develop more effective prevention strategies. 


Overall, the data provides valuable insights into the current situation in Colombia regarding 
accidents and personal injuries. However, further analysis is required to gain a deeper 
understanding of the issue and to design effective interventions to reduce the number of 
accidents and injuries. 
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df = df.set index('fecha hecho').reset index() 


df 
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sns.boxplot(data-con) 
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plt.xticks(rotation-90) 


plt.title('Quantity by behavior') 


plt.xlabel('Behavior') 
plt.ylabel('Quantity') 
plt.show() 
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pd.crosstab(index-df['grupo etario'],columns- df['descripción conducta']) 
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Quantity by behavior 
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Behavior 


Personal injuries are a significant part of accidental injuries, which suggests that it is more 
common for injuries to be reported as having no apparent intention to cause harm than the 
opposite. This may indicate that a large number of injuries are due to negligence or 
carelessness rather than intentional harm. It also highlights the importance of prevention 
strategies and safety measures to reduce the number of accidents and personal injuries that 
occur. Furthermore, accurate reporting of the causes of injuries is essential to develop effective 
policies and programs to prevent and reduce the incidence of personal injuries. 


in [ ]: con melted = pd.melt(con.reset index(), id vars -['grupo etario'], value vars =['lesic 
var name -'causas', value name -'value') 


con melted 
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Out [ 


In [ 


Ts 


IE 
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57920 


25284 


249258 


24112 


55 


colombian acc 


sns.barplot(data-con melted, x-'causas',y-'value',hue-'grupo etario') 


plt.title('Cantidades en tipo de comportamiento y grupo etario') 


plt.xlabel('Behavior') 
plt.ylabel('Quantities') 
plt.show() 


Quantities 


600000 


500000 


400000 


300000 


200000 


100000 


Cantidades en tipo de comportamiento y grupo etario 


lesiones personales 


Behavior 


grupo etario 
adolescentes 
adultos 
menores 

no reporta 


lesiones culposas 


The personal injuries data shows a significant gender gap, with men being affected more than 


women. This could be due to the fact that men are more exposed to heavy or high-risk jobs, or 


they are simply more likely to take risks than women. In terms of intentional or unintentional 


injuries caused by others, it could also be due to their greater involvement in street violence. 
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It's worth noting that these gender differences in personal injuries are not absolute and can vary 
depending on the type of injury and the context in which it occurs. However, this data does 
suggest that there may be certain gender-specific factors that contribute to personal injury 
rates. This underscores the need for gender-sensitive policies and interventions aimed at 


preventing and addressing personal injuries, particularly among men. 


df.to csv("C:/Users/Jorge/Downloads/Projects/colombian acc.csv",encoding = 'utf-8') 


KMEANS 


For this dataset We have decided to use KMeans algorithm to cluster the data and understand 
the performance of each group. 


# Import Libraries for create the model 

import base64 

from pylab import rcParams # For the size of plots 

from sklearn import preprocessing # Library to transfor the data 


# Libraries for the model 

from sklearn.cluster import KMeans 

from sklearn.metrics import f1 score 

from sklearn.cluster import KMeans 

from sklearn.preprocessing import StandardScaler 


We drop the columns we don't need. 


df1 = df.copy() 
df1 = df1.drop(columms=['fecha_hecho','mmunicipio','mes','dia_semana']) # In order to 
dfi 
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departamento 

0 antioquia 

1 antioquia 

2 antioquia 

3 antioquia 

4 atlantico 
1047244 cesar 
1047245 huila 
1047246 tolima 


1047247  cundinamarca 


1047248 ` cundinamarca 


armas medios 
cortopunzante 
cortopunzante 
cortopunzante 
cortopunzante 


cortopunzante 


sustancias 
tóxicas 


sustancias 
tóxicas 


sustancias 
tóxicas 


sin armas 


sin armas 


1047249 rows x 6 columns 


colombian acc 


genero grupo etario 


femenino 


masculino 


masculino 


femenino 


femenino 


masculino 


femenino 


masculino 


masculino 


masculino 


adultos 


adultos 


adultos 


adultos 


adultos 


adultos 


adolescentes 


adultos 


adultos 


adultos 


Transform the categorical values into numerical values. 


CATEGORICAL COLUMNS - 


df1[column] = dfi[column].astype('category').cat.codes 
dfi[column] = df1[column].astype('float64') 


descripción conducta 
lesiones personales 
lesiones personales 
lesiones personales 
lesiones personales 


lesiones personales 


lesiones personales 


lesiones personales 


lesiones personales 


lesiones personales 


lesiones personales 


cantidad 


['departamento','armas medios','genero','grupo etario','descripc 
# Iterate with each object type column and transform it in categorical type to obtaine 
for column in CATEGORICAL COLUMNS: 


We get the Float values for each column, Now we can normalize the data in order to feed the 


model. 


df1.info() 


«class 'pandas.core.frame.DataFrame' > 

RangeIndex: 1047249 entries, @ to 1047248 
Data columns (total 6 columns): 
Non-Null Count 


+ Column 

0 departamento 1047249 
1 armas medios 1047249 
2 genero 1047249 
3 grupo etario 1047249 
4 descripción conducta 1047249 


5 cantidad 


1047249 


dtypes: float64(5), int64(1) 


memory usage: 


To normalize the data we need to get some values from the data 


47.9 MB 


non-null 
non-null 
non-null 
non-null 
non-null 
non-null 
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Data normalization is performed to ensure that all variables are on the same scale. This is done 


to avoid variables with higher numerical values having a disproportionate weight. The formula 


for normalization is as follows: 


z— 


Lnorm = 


Tmean 


std 


Where z is the original variable, £norm is the normalized variable, Zmean is the minimum value 


of the variable and std is the standard deviation of the variable. 


train stats = df1.describe() 


train_ 


count 
mean 
std 
min 
25% 
50% 
75% 


max 


stats 


departamento 
1.047249e+06 
1.558220e+01 
9.722598e+00 
0.000000e +00 
6.000000e+00 
1.500000e+01 
2.600000e+01 


3.100000e+01 


def norm(x): 


return (x - train stats.loc['mean']) / train stats.loc['std'] 


df2 = 
df2 


df3 
df3 


norm(df1) 


armas medios 
1.047249e+06 
7.022941e+00 
4.026562e+00 
0.000000e+00 
4.000000e+00 
5.000000e+00 
1.300000e+01 


1.400000e+01 


genero 


1.047249e+06 


6.747555e-01 


5.732182e-01 


0.000000e +00 


0.000000e+00 


1.000000e+00 


1.000000e+00 


2.000000e+00 


grupo etario descripción conducta 


1.047249e+06 


1.057421e+00 


5.896843e-01 


0.000000e+00 


1.000000e+00 


1.000000e+00 


1.000000e+00 


3.000000e+00 


df.drop(columns-'fecha hecho').to numpy() 


norm(df1) 
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1.047249e- 06 


7.147679e-01 


4.515251e-01 


0.000000e+00 


0.000000e+00 


1.000000e+00 


1.000000e+00 


1.000000e+00 


cantidad 
1.047249e+06 
1.617188e+00 
2.163696e+00 
1.000000e+00 
1.000000e+00 
1.000000e+00 
1.000000e+00 


1.140000e+02 
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departamento armas_medios 


1047244 


1047245 


1047246 


1047247 


1047248 


1047249 rows x 6 columns 


-1.499826 


-1.499826 


-1.499826 


-1.499826 


-1.294119 


-0.574147 


0.145825 


1.277210 


-0.368441 


-0.368441 


-0.502399 


-0.502399 


-0.502399 


-0.502399 


-0.502399 


1.236057 


1.236057 


1.236057 


0.739355 


0.739355 


colombian_acc 


genero grupo etario descripción conducta 


-1.177135 


0.567401 


0.567401 


IMANES 


-1.177135 


0.567401 


215177135 


0.567401 


0.567401 


0.567401 


-0.097376 


-0.097376 


-0.097376 


-0.097376 


-0.097376 


-0.097376 


-1.793198 


-0.097376 


-0.097376 


-0.097376 


0.631708 


0.631708 


0.631708 


0.631708 


0.631708 


0.631708 


0.631708 


0.631708 


0.631708 


0.631708 


cantidad 


0.176925 


-0.285247 


-0.285247 


-0.285247 


0.176925 


-0.285247 


-0.285247 


-0.285247 


-0.285247 


-0.285247 


Now, We need to find the K number, Kmeans algorithm needs to number of cluster to create 


the model, for this reason we use the elbow method to find the K number. 


Elbow Method 


I'll find th n cluster that better fit to the data 


kmeans kwargs = { 
"init": "random", 
EDT EO a Ae) 

"random state": 1, 


} 


#create List to hold SSE values for each k 


sse = [] 


for k in range(1, 20): 


kmeans = KMeans(n clusters-k, **kmeans_kwargs ) 
kmeans. fit (df3) 
sse.append(kmeans.inertia_) 


#visualize results 
plt.plot(range(1, 20), sse) 


plt.xticks(range(1, 20)) 


plt.xlabel( "Number of Clusters") 
plt.ylabel("SSE") 


plt.show() 
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le6 


SSE 


1 2 3 4 5 6 7 8 9 1011 12 13 14 15 16 17 18 19 
Number of Clusters 


| use Kneed to detect the optimal cluster to build the model, this method allowed me find out 
the optimal number of cluster gave me as result 6 clusters, Now, | can build the model 


from kneed import KneeLocator 
cost knee c3 - KneeLocator( 
x- range(1,20), 
y=sse, 
S-0.1, curve="convex", 
direction-"decreasing", online=True) 


K_cost_c3 = cost_knee_c3.elbow 
print('Elbow at K z',f'(K cost c3:.0f) clusters') 


Elbow at K = 6 clusters 


Build the model 


bluid the model with the cluster that Kneed gave me as result above 


# Construir modelo 

from sklearn.cluster import KMeans 

km = KMeans(init="k-means++", n clusters-6, max iter-10000,n init-20,algorithm-'elkan' 
km.fit(df3) 


KMeans(algorithm='elkan', max iter-10000, n clusters-6, n init-20) 
km.labels  Zclusters 
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array([0, 3, 3, ..., 5, 3, 3]) 


km.cluster centers  £ centroids 


array([[-0.58740587, -0.48161008, -1.17713547, -0.239587 , 0.62796963, 
-0.08132174], 
[ 0.09686951, 1.38980442, -0.06192137, -0.10393527, -1.58057542, 
-0.04963278], 
[-0.07873142, -0.43652053, 2.27684899, 3.29421016, 0.62961823, 
-0.05165761], 
[-0.79556882, -0.52470725, 0.56796301, -0.21311655, 0.62777335, 
-0.09076381], 
[-0.25868609, -0.28856714, -0.02922348, 0.14320751, 0.31397215, 
7.96882627], 
[ 1.03680752, -0.65610897, -0.02149409, -0.24372633, 0.63073792, 
-0.0790793 ]]) 


kmeans.predict(X-df3, sample weight-5) 
array([ 5, 3, 3, ..., 2, 11, 11]) 

# Create the new data frame with cluster 
cluster map = pd.DataFrame() 


cluster map['data index'] = df1.index.values 
cluster map['cluster'] = km.labels 


cluster map 


data index cluster 


0 0 0 
1 1 3 
2 2 3 
3 3 0 
4 4 0 
1047244 1047244 3 
1047245 1047245 0 
1047246 1047246 5 
1047247 1047247 3 
1047248 1047248 3 


1047249 rows x 2 columns 


groups = pd.concat([df.reset_index(),cluster_map],axis=1) # concatenate this dataframe 
groups = groups.drop(columns-['data index','index']) 
groups 
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fecha hecho departamento municipio armas medios genero grupo etario descripciór 

O 2010-01-01 antioquia girardota  cortopunzante femenino adultos lesiones 

1 2010-01-01 antioquia girardota ` cortopunzante masculino adultos lesiones 

2 2010-01-01 antioquia mutatá  cortopunzante masculino adultos lesiones 

3 2010-01-01 antioquia necoclí cortopunzante femenino adultos lesiones 

4 2010-01-01 atlántico PU cortopunzante femenino adultos lesiones 
1047244 2022-05-03 cesar valledupar SE masculino adultos lesiones 

(ct) toxicas 

1047245 2022-06-16 huila oporapa xn femenino adolescentes lesiones 
1047246 2022-04-17 tolima  ¡bagué (ct) be masculino adultos lesiones 
1047247 . 2022-03-30  cundinamarca cota sin armas | masculino adultos lesiones 
1047248 2022-06-10  cundinamarca guaduas sin armas masculino adultos lesiones 


1047249 rows x 11 columns 


| notice each result of the cluster with categorical data 


groups[groups.cluster == 0].describe(include-'object') # cluster 1 
departamento municipio armas medios genero grupo etario descripción conducta 
count 197866 197866 197866 197866 197866 197866 
unique 25 827 15 1 4 2 
: bogotá d.c. ; ] 
top  cundinamarca (ct) contundentes femenino adultos lesiones personales 
freq 36764 17427 104480 197866 171363 197532 
groups[groups.cluster == 1].describe(include-'object') # cluster 2 
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departamento municipio armas medios genero grupo etario descripción conducta 
count 296788 296788 296788 296788 296788 296788 
unique 32 982 7 3 4 2 
top valle cali (ct) vehiculo masculino adultos lesiones culposas 
freq 43426 16713 262253 189665 247443 296462 
groups[groups.cluster == 2].describe(include='object') # cluster 3 
departamento municipio armas medios genero grupo etario descripción conducta 
count 57226 57226 57226 57226 57226 57226 
unique 32 993 14 3 2 2 
2 bogota d.c. no i 
top  cundinamarca contundentes no reporta lesiones personales 
(ct) reporta 
freq 9842 3047 31551 56345 57224 57172 
groups[groups.cluster == 3].describe(include='object') # cluster 4 
departamento municipio armas medios genero grupo etario descripción conducta 
count 223451 223451 223451 223451 223451 223451 
unique 21 699 15 2 3 2 
. bogotá d.c. : e 
top ` cundinamarca (ct) contundentes masculino adultos lesiones personales 
freq 49086 22484 94024 223379 196780 223054 
groups[groups.cluster == 4].describe(include='object') # cluster 5 
departamento municipio armas medios genero grupo etario descripción conducta 
count 9389 9389 9389 9389 9389 9389 
unique 27 125 7 3 4 2 
. bogotá d.c. . . 
top | cundinamarca (ct) contundentes masculino adultos lesiones personales 
freq 7480 7456 3992 4828 8690 8042 


In the plot bellow notice the cluster 4 got less than the other cluster, cluster 1 got more than 
other clusters. 


groups['cluster'].value counts().plot(kind- bar') 


«AxesSubplot:» 
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In [ ]: def denorm(x): 


return (x * train stats.loc['std'] + train stats.loc['mean']) 
#df4 = df3.drop(columns-'cLuster ') 


df4 - denorm(df3) 


df4 = pd.concat([df4,cluster map],axis-1).drop(columns-'data index') 


In [ ]: dfa 


Out[ ]: 


1047244 
1047245 
1047246 
1047247 


1047248 


1047249 rows x 7 columns 
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# Inspect the categorical variables 
df.select dtypes( 'object').nunique() 


departamento 
municipio 

armas medios 

genero 

grupo etario 
descripción conducta 
dtype: int64 


# Check missing value 
df4.isna().sum() 


departamento 
armas_medios 

genero 

grupo_etario 
descripción conducta 
cantidad 

cluster 

dtype: int64 


df region = pd.DataFrame(groups['departamento'].value counts()).reset index() 


32 


1023 


GODT o GO Go Gi 


15 


Nu 


df region['Percentage'] = df region['departamento'] / groups['departamento'].value cot 


df region.rename(columns = ('index':'departamento', 


df region 


'departamento':'Total'j, inplace 
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30 


31 


departamento 
cundinamarca 
valle 
antioquia 
santander 
tolima 
boyaca 
atlantico 
huila 
risaralda 
nariño 
bolívar 
meta 
caldas 
norte de santander 
cauca 
córdoba 
quindío 
sucre 
magdalena 
cesar 
casanare 
guajira 
caquetá 
chocó 
arauca 
putumayo 
san andrés 
amazonas 
guaviare 
vichada 
guainía 


vaupés 


Total 


134439 


120891 


105105 


85237 


52423 


48113 


43755 


42801 


38702 


35336 


33979 


33952 


32739 


30889 


28050 


24939 


22149 


21188 


21103 


19384 


15282 


13141 


11457 


7716 


7683 


5121 


4511 


2879 


2108 


1115 


922 


740 


Percentage 
0.128373 
0.115437 
0.100363 
0.081391 
0.050058 
0.045942 
0.041781 
0.040870 
0.036956 
0.033742 
0.032446 
0.031847 
0.031262 
0.029495 
0.026784 
0.023814 
0.021150 
0.020232 
0.020151 
0.018509 
0.014593 
0.012548 
0.010940 
0.007368 
0.007336 
0.004890 
0.004307 
0.002749 
0.002013 
0.001065 
0.000880 


0.000707 
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df region = df region.sort values('Total', ascending = False).reset index(drop = True) 


df region.plot.bar(x-'departamento', yz'Percentage') 


<AxesSubplot: xlabel='departamento' > 


EE Percentage 
0.12 


0.10 


0.08 


0.06 


0.04 


0.02 


0.00 


2 


nariño 
sucre 
caquetá 
vaupés 


magdalena 
cesar 


valle 

huila 

risaralda 

meta 

caldas 

2 norte de santander 
casanare 


antioquia 
tolima 


boyacá 
chocó 
arauca 


Santander 
atlántico 
bolívar 
cauca 
córdoba 
quindío 
guajira 
putumayo 
san andrés 
amazonas 
guaviare 
vichada 
guainía 


cundinamarca 


partamento 


This graph shows us the percentage of accidents and personal injuries by department. We can 
see that Cundinamarca still holds the highest percentage, followed by Antioquia and Valle del 
Cauca. These departments are the most populous ones in the country, so it is expected that 
they would have a higher number of accidents and personal injuries. However, it is still 
concerning to see that the percentage of accidents and personal injuries is quite high in these 
areas. 


It is important to note that some departments, such as Vaupés and Guainía, have very low 
percentages. These are remote and sparsely populated regions in the country, so it is not 
surprising that they have lower numbers of accidents and personal injuries. 


Overall, this graph gives us an idea of the distribution of accidents and personal injuries by 
department in Colombia. It can be a useful tool for policymakers and organizations to identify 
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areas where more attention and resources are needed to reduce the incidence of accidents and 


personal injuries. 


# Cluster interpretation 
groups.groupby('cluster').agg( 


{ 
'departamento': lambda x: x.value_counts().index[0], 
'municipio': lambda x: x.value counts().index[0], 
'genero': lambda x: x.value_counts().index[0], 
'armas medios': lambda x: x.value_counts().index[0], 
'grupo etario': lambda x: x.value_counts().index[0], 
'descripción conducta': lambda x: x.value counts().index[0], 
'cantidad': 'mean', 
j 
).reset index() 
cluster departamento municipio genero armas medios grupo etario descripción conducta  : 
; bogotá ; . 
0 O  cundinamarca d.c. (ct) femenino contundentes adultos lesiones personales 
1 1 valle cali (ct) masculino vehiculo adultos lesiones culposas 
: bogotá no C 
2 A  cundinamarca contundentes no reporta lesiones personales 
d.c. (ct) reporta 
5 bogotá : : 
3 3  cundinamarca kE masculino contundentes adultos lesiones personales 
; bogotá e . , 
4 4 cundinamarca d.c. (ct) masculino contundentes adultos lesiones personales 1: 
5 5 valle cali (ct) masculino contundentes adultos lesiones personales 
» 
Z = groups.copy() 
Z = Z.drop(columns-['dia semana','mes','fecha hecho','municipio']) 
Z - pd.get dummies(Z) 
7 
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cantidad departamento_amazonas departamento antioquia departamento arauca departan 


0 2 0 1 0 
1 1 0 1 0 
2 1 0 1 0 
3 1 0 1 0 
4 2 0 0 0 
1047244 1 0 0 0 
1047245 1 0 0 0 
1047246 1 0 0 0 
1047247 1 0 0 0 
1047248 1 0 0 0 


1047249 rows x 63 columns 


PCA 


In order to vizualize the results of the clusters that we found above, We have use the PCA 
method to reduce the dimensionality of the data. This algorithm allowed us vizualize the data in 


3 dimensions. 


from sklearn.decomposition import PCA 


# Obtención de componentes principales 
pca - PCA(n components-3) 

pca.fit(Z) 
transformada-pca.transform(Z) 


# Código de visualización 


print("Explained Variance for each component:", pca.explained variance ) 
print("Explainded Variance Ratio for each component:", pca.explained variance ratio ) 


Varianza explicada por cada componente: [4.6938254 0.91859283 0.56984193] 
Proporción de varianza explicada por cada componente: [0.55876213 0.10935108 0.067835 
EEN 


che vëlleg c {Os "real. ys "en" Ae (cel? as "real" Ale c 3g "S p 


groups['cluster'] = groups['cluster'].replace(dict cluster) 
groups 
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fecha hecho departamento municipio armas medios genero grupo etario descripciór 

O 2010-01-01 antioquia girardota  cortopunzante femenino adultos lesiones 

1 2010-01-01 antioquia girardota cortopunzante masculino adultos lesiones 

2 2010-01-01 antioquia mutatá  cortopunzante masculino adultos lesiones 

3 2010-01-01 antioquia necoclí cortopunzante femenino adultos lesiones 

4 2010-01-01 atlántico S cortopunzante femenino adultos lesiones 
1047244 2022-05-03 cesar n E masculino adultos lesiones 
1047245 2022-06-16 huila oporapa is femenino adolescentes lesiones 
1047246 2022-04-17 tolima  ¡bagué (ct) be masculino adultos lesiones 
1047247 . 2022-03-30  cundinamarca cota sin armas | masculino adultos lesiones 
1047248 | 2022-06-10 | cundinamarca guaduas sin armas masculino adultos lesiones 


1047249 rows x 11 columns 


# Scatter 


# Import Libraries 
from mpl toolkits.mplot3d import axes3d 
import matplotlib.pyplot as plt 


# create figure 

fig - plt.figure() 

# Create 3D 

ax1 - fig.add subplot(111, projection-'3d') 


Defining the data 

transformada[:,0] 
transformada[:,1] 
transformada[:,2] 


HN ke x + 
D D 


+ 


Defining the colors 
#color = df['genero'].map({'masculino':'b','femenino':'r', 'no reporta':'g'}) 
cole Ex faepe "cun leu eal? o Plo 5 "cO erie, Tes erry "enl: yw" "c oti, "e 


# make the scatter plot 
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ax1.scatter(x, y, Z, s-30, c=color, alpha-0.9, edgecolors-'k', linewidths=0.5) 
# show the plot 


plt.show() 


100 


This is the result of the clustering algorithm applied to the data. The graph shows how the 
algorithm grouped the data and how it is distributed in 3 dimensions. We can observe that 
there are clear clusters with a significant amount of data points that are tightly packed together, 
while other points seem to be more spread out. The clustering algorithm can be a useful tool to 
identify patterns and groupings in data that might not be immediately apparent, providing 
insights and aiding decision-making processes. However, it is important to keep in mind that 
the results of the clustering algorithm are only as good as the data and the chosen parameters, 
and may require further analysis and refinement. 


Conclusions 


The analysis of accidents in Colombia has allowed us to delve deeper into the behavior of 
accidents, discover how they affect different ages and genders, and analyze the type of 
weapons involved. Machine learning algorithms have helped us to identify patterns and trends 
in accidents, and provide decision-makers with important data to help create effective programs 
and policies to improve safety throughout the country. This initiative offers an excellent 
opportunity to contribute to Colombia's safety and improve the lives of its citizens. Our team is 
committed to working on this initiative to ensure that this important task is carried out 
efficiently and effectively. We are convinced that our work will make a significant contribution to 


improving the security and quality of life of Colombians. 
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Based on the information and data presented, some conclusions that could be drawn from this 
project are: 


e Blunt and sharp objects, as well as vehicular accidents, are the most common causes of 
personal injuries and accidents in Colombia. 

* The majority of these accidents seem to be caused by unintentional events, although 
intentional acts of violence cannot be ruled out. Men, particularly adult men, are more likely 
to be affected by personal injuries and accidents than women. 

e The use of blunt objects seems to affect women more than sharp objects. 

e Violence in the streets and intrafamily violence could be important factors contributing to 
personal injuries and accidents. 

e There is a need for further research and data analysis to better understand the causes and 
circumstances surrounding personal injuries and accidents in Colombia. 

e These findings suggest the importance of implementing policies and measures aimed at 
preventing accidents and reducing violence in the country. 


Additionally, it highlights the need for more comprehensive data collection and analysis to 
better understand the causes of injuries and accidents, particularly those related to violence and 
criminal activity. Policymakers and public health officials can use this information to develop 
targeted interventions and preventive measures to reduce the incidence of injuries and improve 


the overall health and safety of the population. 


Overall, this project underscores the importance of data-driven approaches to public health and 
safety, as well as the potential of data visualization tools to communicate complex information 
in an accessible and actionable way. 


| invite you to take a look at the dashboard | designed on Power BI, where you can explore and 
analyze the data of accidents and personal injuries in Colombia. You will find several interactive 
visualizations that will allow you to dig deeper into the information and understand the patterns 
and trends of this problem in the country. To access the dashboard, please follow this link: 
https://onx.la/71e87. | hope you find it interesting and informative. Let me know if you have any 


questions or feedback! 


Some Suggestions 


Include more data: The dataset used in this project is limited to the years 2014-2018 and to 
reported cases only. To obtain a more comprehensive understanding of the issue, it would be 


helpful to gather data from a wider time frame and include unreported cases as well. 


Include more variables: While the current dataset provides valuable information on the type of 
accidents and injuries that occur in Colombia, including more variables such as the location and 
time of day of the incidents could provide further insights into the issue. 


Further analysis: The current project provides a good overview of the trends and patterns of 


accidents and injuries in Colombia. However, conducting further analysis using advanced 
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statistical techniques could uncover more complex relationships between variables and provide 
a more nuanced understanding of the issue. 


Collaboration with local authorities: To address the issue of accidents and injuries in Colombia, it 
would be helpful for researchers to collaborate with local authorities to develop and implement 


targeted interventions and policies aimed at preventing accidents and injuries. 


Future approaches 


e Conducting a more in-depth analysis of the causes and circumstances behind the injuries 
and accidents in each department, as well as their distribution by gender and age. This 
could provide more insights into the root causes of the injuries and help identify potential 


interventions. 


e Examining the economic costs associated with injuries and accidents, including medical 
expenses, lost income, and disability costs. This could help policymakers prioritize 


interventions and allocate resources more effectively. 


e Studying the effectiveness of existing policies and interventions aimed at reducing injuries 
and accidents, and identifying areas for improvement. This could involve evaluating specific 
policies, such as traffic safety laws or regulations on the use of weapons, and analyzing 
their impact on injury rates. 


* Using machine learning algorithms to predict injury rates in different regions of the country 
based on demographic, economic, and social indicators. This could help identify regions at 


risk and target interventions more effectively. 


e Collaborating with local communities and organizations to develop tailored interventions 
that address specific needs and challenges. This could involve working with community 
leaders, healthcare providers, and government agencies to develop and implement 


evidence-based programs and policies. 
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